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ABSTRACT 

COSMOGRAIL is a long-term photometric monitoring of gravitationally lensed quasars aimed at implementing Refsdal’s time- 
delay method to measure cosmological parameters, in particular Hq. Given the long and well sampled light curves of strongly lensed 
quasars, time-delay measurements require numerical techniques whose quality must be assessed. To this end, and also in view of future 
monitoring programs or surveys such as the LSST, a blind signal processing competition named Time Delay Challenge 1 (TDCl) 
was held in 2014. The aim of the present paper, which is based on the simulated light curves from the TDCl, is double. First, we 
test the performance of the time-delay measurement techniques currently used in COSMOGRAIL. Second, we analyse the quantity 
and quality of the harvest of time delays obtained from the TDCl simulations. To achieve these goals, we first discover time delays 
through a careful inspection of the light curves via a dedicated visual interface. Our measurement algorithms can then be applied 
to the data in an automated way. We show that our techniques have no significant biases, and yield adequate uncertainty estimates 
resulting in reduced values between 0.5 and 1.0. We provide estimates for the number and precision of time-delay measurements 
that can be expected from future time-delay monitoring campaigns as a function of the photometric signal-to-noise ratio and of the 
true time delay. We make our blind measurements on the TDCl data publicly available. 
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1. Introduction 


The methods used to constrain current cosmological models all 
benefit from independent measurements of the local value of the 
Hubble parameter, Hq (see e.g. Fig. 48 of [Weinberg et al.|2013| ). 
One way of achieving a measurement of Hq is based on time de¬ 
lays in strong gravitational lens systems. The method, first sug¬ 
gested by |Refsd Jj 1 964| ), proposes measuring the differences in 
the travel time of photons coming from multiple images of a dis¬ 
tant source, such as a quasar. This time delay. At, is connected to 
the overall matter distribution responsible for the lensing effect, 
and to the time-delay distance Da? to the lens, i.e. Hq, with some 
sensitivity to curvature and dark energy as well (e.g.,|Suyu et al.l 
| 2 ^ . 

Exploiting this relationship to constrain Hq and cosmology 
in general requires both an accurate mass mo del for the lens and 
accurate time delay measurements (see e.g., jSuyu et ar]|2Q12[ 
|Linder|20TH|Moustakas et al.|2009| ). Modelling the lens mass in 
an unbiased way is difficult and pro ne to degeneracies known 
as the mass-sheet degeneracy (e.g., [Schneider & Slus^|2Q13| ) 
an d, more generally, the sour ce plane transformation described 
Schneider & Sluse ( [2014[ ). The effect of lens model degen 


eracies can be mitigated by combining astrometric information 
from high resolution imaging, measurements of the lens dynam¬ 
ics, priors on the mass density profile of the lens, and an anal- 


ysis of structures along the line of sight (e.g.,[Suyu et al. 

2014 

Greene et al.|2013[[Fadely et al. 2010 

[Treu & Koopmans 

2002 

Falco et al.[1997[ Keeton & Kochanek 

1997J. Integral fielc 

spec- 


troscopy coupled with the adaptive optics that is becoming avail¬ 
able on the VLT and at the Keck observatory will be essential in 
this respect. One of the key ingredient for the method to work 
at all is, however, the quality of the time-delay measurements, 
which is the focus of the present work. 


In practice, measuring time delays is achievable if the 
lensed source is photometrically variable. Gravitationally lensed 
quasars are ideal targets: the quasars can show variability ac¬ 
cessible by moderately sized ground-based optical telescopes on 
timescales of a few days, while the time delays resulting from 
galaxy-scale lenses are typically of the order of weeks to months 
(see, e.g., Qguri & Marshall|2010[ ). The intrinsic light curve of 
the quasar is seen shifted by the relative time delays in the differ¬ 
ent lensed images. However, this simple situation is often con¬ 
taminated: microlensing by stars in the lensing galaxy introduces 
extrinsic variability in the individual light curves with an ampli¬ 
tude sometimes comparable with that of the intrinsic variability 
of the quasar. To yield competitive cosmological constraints, re¬ 
liable time-delay measurements with percent-level precision are 
needed ( [Treu|201^ . An efficient implementation of these mea¬ 
surements has long been hampered by how difficult it is to obtain 
photometric data for periods of many years at a d esirable ca- 
dence, which must be close to 1 epoch per few days ([Eigenbrod[ 
[et al.|2005[ ). 


COSMOGRAIL is a monitoring program targeting more 
than 30 strongly lensed quasars using meter-class telescopes, 
with a cadence of 3 days for the most interesting systems. Recent 
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results include light curves and time-delay measurements that 
are accurate to within a few percent points in HE 0435-1223 
dCourbin et al.|2011| i, SDSS J lOOl+5027 dRathna Kumar et al.| 
|2013| l and in RX J1131-1231 dTewes et al.|2013b| l. To measure 
these time delays, we developed and implemented several algo¬ 
rithms in the form of a COSMOGRAIL curve-shifting toolbox 
named PyCS, described in |Tewes et al.| ( |2013a| ). 

In the fall of 2013, a first blind time-delay measurement 
competition named Time Delay Challenge 1 (T DCl) was pro¬ 
posed to the community by |Dobler et al. ( |2015| ). The two main 
goals of this open challenge were (1) to homogeneously assess 
the quality of time-delay measurement algorithms on a common 
set of realistic synthetic light curves, and (2) to obtain some 
quantitative informations on the impact of observing strategy 
(cadence, season length, campaign length) on time-delay mea¬ 
surements. We took part in this challenge a nd submitted time- 
delay estimates using the team name PyCS. |Liao et ah] ( [2Q15| ) 
give a summary of the results from all TDCl participating teams, 
as well as so me general conclu sions. The present paper is com¬ 
plementary to |Liao et ar] ( |2Q15| ). It focuses on the PyCS methods 
that we also apply to real light curves, and hence assesses the 
quality and reliability of the COSMOGRAIL time-delay mea¬ 
surements. 

To evaluate our existing techniques with the thousands of 
light curves of TDCl under conditions similar to the analysis of 
COSMOGRAIL data, we separated the problem of time-delay 
measurement of a light curve pair into two successive stages. 


Stage I: we first attempt to discover a plausible time delay, with¬ 
out trying to measuring it precisely. We also evaluated how 
confident we were that the proposed approximate solution 
is close to the true time delay and not a catastrophic fail¬ 
ure. Owing to the limited length of the monitoring seasons, 
the randomness of quasar variability, noise and microlens- 
ing, this was not possible for every light curve pair of TDCl 
or a real monitoring campaign. We note that in the case of 
TDCl we had no prior information on the time delay to look 
for, as we had no knowledge of the mass distribution in the 
lensing galaxy. Only the light curves themselves were used. 

Stage II: for those systems for which Stage I was successful, 
we then focused on accurately estimating the time delay and 
associated uncertainty with the PyCS techniques, constrain¬ 
ing the algorithms to a delay range around the solution from 
Stage I. As the PyCS Stage II methods did not rely on a phys¬ 
ical model of the light curves, they would not be able to deal 
adequately with comparing odds among very different solu¬ 
tions. 


This two-stage structure is of general interest beyond PyCS, as 
the stages concern discernible aspects of the time-delay mea¬ 
surement problem. Stage I deals with the quantity and the purity 
of time-delay measurements, while Stage II deals with their ac¬ 
tual accuracy. 

The paper is structured as follows. Section [^summarizes the 
data from the Time Delay Challenge 1 and the metrics used to 
evaluate techniques. In Sect. [^ we present the approaches we 
took to address Stage I, while Sect. [^ presents the Stage II al¬ 
gorithms. In Sect. [5] we dis cuss the results, expanding on the 
analysis of |Liao et~^ ( |2015 ), and we conclude in Sect.j^ 


2. Time Delay Challenge 1 

The mock data used in this work are synthetic quasar light curves 
made publicly available in the context of the the Time Delay 


Challenge 1 (TDCl) proposed by |Dobler et al.| ( [2015| ). These 
data mimic the photometric variations seen in real gravitation¬ 
ally lensed quasars, with different time sampling, number of sea¬ 
sons, and season length. The curves are generated with simple 
yet plausible noise properties, and include microlensing vari¬ 
ability. The dataset is split into five “rungs” or stages that sim¬ 
ulate different observational strategies, each rung consisting of 
1024 pairs of light curves. The rungs randomly mix microlens¬ 
ing, noise properties, and variability but differ in time sampling, 
number of season s, and season len gth. These differences are 
listed in Table 1 of |Liao etar] ( |2015| hereafter TDCl paper). 

Participants to the TDCl were asked to provide their best 
point estimate ISti and associated l-cr uncertainty St for as many 
pairs of curves as possible. The submitted entries to the chal¬ 
lenge were then compared to the true input delays, and evalu¬ 
ated using simple metrics probing different properties of the es¬ 
timates. The details of how the simulations were set up, as well 
as a discussion of these metrics are given in |Dobler et al.| ( [20T5] ). 
Results obtained by the blind submissions of the different par¬ 
ticipating teams are summarized in the TDCl paper, including 
our submissions denoted “PyCS”. Lor completeness, we briefly 
summarize the four main metrics: 


1. The fraction / of submitted time delay estimates, 

/ = A^submit/A^, (1) 


where A^submit is the number of measured time delays and N 
is the total number of curves. _ 

2. The mean between the measured time delay Ati and the 
true value Ati weighted using the estimated uncertainties d/. 


1 ( Ati ~ \ 

Af submit I V / 


3. The mean “claimed” precision P of the time-delay measure¬ 
ments, computed from the estimated uncertainties Si, 


P = 


1 


A^submit 



(3) 


4. The mean accuracy A of the time-delay measurements. 


1 ( Ati ~ Ati 1 

A^submit V Ati ) 


(4) 


To analyse the results in greater detail, we also introduce two 
modified metrics. Lirst, a “blind” precision. 


where we replace in Eq.j^the true value of Ati by its estimation 
Ati. This metric can be computed without knowing of the true 
time delays; its summation terms are useful, for instance to sort 
light curve pairs of a real monitoring survey. Second, we intro¬ 
duce a variant of the accuracy 



where we replace the Ati in the denominator of Eq.j^by its abso¬ 
lute value. While A is sensitive to a bias on the amplitude of Ati 
(i.e., over- or underestimation of delays), Aabs responds to signed 
biases. 



= ^Vp, = ^y 

A^submit 


A^submit 
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Statistical uncertainties on these metrics are computed fol¬ 
lowing 

^err ~ ^x/ submit? C^) 

where (Tx is the sample standard deviation of the summation 
terms of X = P, and A. 


3. Stage I: discovering time deiays 

To apply any of the PyCS time-delay measurement algorithms 
( |Tewes et al.||2013^ to a pair of light curves, a prior estimate 
of the time delay is required. Depending on the considered light 
curves, identifying this delay from the data might be difficult or 
even impossible. In the following, we describe two approaches 
to discover rough time-delay estimates (Stage I). Both methods 
rely solely on the light curves without considering the configura¬ 
tion of the lens system. The first method is based on a visual in¬ 
spection of the light curves and is t he method we use d to blindly 
prepare submissions for the TDCl ( [Liao et al.|2015 ). We devel¬ 
oped the second method after the TDCl results were released. 
We use the data from the challenge to evaluate the relative mer¬ 
its and drawbacks of each method. 


3.1. D3CS; D3 visual curve shifting 


This method is based on visual inspection of the light curves, 
in the spirit of citize n science projects (e.g., see the review by 
Marshall et al.|2014 ). To ease this process, we developed a ded¬ 


icated browser-based visualization interfac e, using the D3. js 
JavaScript librar^by Bostock et al. ( 2011 ). We have made this 
interface public0 

The main motivations behind this time-costly yet simple ap¬ 
proach were (1) to obtain rough initial estimates for the time 
delays and their associated uncertainties, and (2) to estimate 
how confident one can be that the time-delay estimations are 
not catastrophic outliers. Clearly, visual curve-shifting allows 
for more freedom than any automated procedure. It also permits 
dealing in a more flexible way with unusual behaviour of light 
curves, such as highly variable signal-to-noise from one season 
to the next, extreme microlensing, or even when the time delays 
are comparable in length to the visibility period of the object. 

Our interface allows users to interactively shift the light 
curves in time, magnitude, and fiux, and to zoom in on sections 
of the data. It permits the visual estimation of the time delay 
and of an associated uncertainty. Importantly, the interface also 
asks to pick one of four choices of confidence category for the 
proposed solution: 


1. secure: if a catastrophic error can be excluded with a very 
high confidenc^ 

2. plausible: if the solution yields a good fit and no other 
solutions are seen; 

3. multimodal: if the proposed solution is only one among two 
or more possible solutions that are equally satisfactory; 

4. uninformative: if the data does not allow the estimate of 
any time delay. 


Unlike massive crowdsourcing programs (e.g. Galaxy Zoo; 
Lintott et al.||2008| ), only four scientists participated in the vi¬ 
sual inspection of TDCl and each pair of curves was measured 


^ Data-Driven Documents, http: //www. d3 j s. org/ 

^ http://www.astro.uni-bonn.de/~mtewes/d3cs/tdcl/ (See 

“Read me first” for help) _ _ 

^ This same category was named doubtless in Liao et al. (20151 


1200 



[days] 

Fig. 1. Left panel: stacked histogram of the errors made by the 
visual time-delay estimation, for the secure and plausible 
D3CS samples. All the secure estimations are within the dis¬ 
played range of errors of +20 days. Only 2.6% of the time delays 
in the plausible sample have an absolute error larger than 20 
days. Right panel: absolute time-delay error made by D3CS as a 
function of the true delay for both samples. 


by at least two independent scientists. The behaviour of different 
users in terms of time spent per time-delay measurement span 
a broad range. Fast users spend on average 25 seconds per ob¬ 
ject, while slower users spend more than 60 seconds per object. 
This includes the time taken to measure a delay, to estimate a 
Icr uncertainty, and to allocate one of the four confidence levels 
described above. 

To obtain a single Stage I estimation for each light curve 
pair, we reduce the results obtained by all four scientists in a very 
conservative way. We systematically downgrade to multimodal 
the confidence category of pairs with conflicting time-delay es¬ 
timates. 

We define samples combining the four confidence cate¬ 
gories as follows: sec stands for secure only, seep la stands 
for secure + plausible, and secplamul for secure + 
plausible + multimodal. The combination of all estima¬ 
tions is labelled all. 

Figure shows the distribution of the error on the time- 
delay estimation versus the true time delay for the secure and 
plausible D3CS subsamples. Tablesummarizes the results of 
the D3CS classification and displays the fraction of catastrophic 
outliers 6, i.e. time-delay estimations more than 20 days away 
from the truth. Notably, the secure sample contains 1623 time- 
delay estimates free of any catastrophic outliers. 

Through this simple experiment, we have demonstrated that 
such an approach is easily manageable for about 5000 light 
curves. In total the four scientists involved in the visual esti¬ 
mation of the time delays spent 150 hours measuring the 5120 
delays. We note that 30% of the time delays were measured by 
three or more users. 

3.2. Attempt to design an automated Stage I procedure 

Visual inspection of the light curves is a time-consuming process 
that cannot be repeated many times. Designing an automated 
method whose efficiency approaches that of D3CS is therefore 
complementary and would help to minimize the time spent on 
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Table 1. D3CS classification of the TDCl light curve pairs. 


Rung 0 

Estimates 

6 

Secure 

548 (53.5%) 

0% 

Plausible 

291 (28.4%) 

2.1% 

Multimodal 

60 (5.9%) 

30.0% 

Uninformative 

125 (12.2%) 

- 

Rung 1 

Estimates 

6 

Secure 

288 (28.1%) 

0% 

Plausible 

383 (37.4%) 

1.3% 

Multimodal 

127 (12.4%) 

17.3% 

Uninformative 

226 (22.1%) 

- 

Rung 2 

Estimates 

6 

Secure 

223 (21.8%) 

0% 

Plausible 

406 (39.6%) 

1.2% 

Multimodal 

168 (16.4%) 

27.4% 

Uninformative 

227 (22.2%) 

- 

Rung 3 

Estimates 

6 

Secure 

329 (32.1%) 

0% 

Plausible 

324 (31.7%) 

4.9% 

Multimodal 

161 (15.7%) 

18.6% 

Uninformative 

210 (20.5%) 

- 

Rung 4 

Estimates 

6 

Secure 

235 (23.0%) 

0% 

Plausible 

430 (42.0%) 

3.5% 

Multimodal 

108 (10.5%) 

26.9% 

Uninformative 

251 (24.5%) 

- 

All Rungs 

Estimates 

6 

Secure 

1623 (31.7%) 

0% 

Plausible 

1834 (35.8%) 

2.6% 

Multimodal 

624 (12.2%) 

23.2% 

Uninformative 

1039 (20.3%) 

- 


Notes. The D3CS visual estimates for t he tim e delays are shown for the 4 
confidence categories defined in Sect. The fraction of catastrophic 
outliers is given for each rung by 6, i.e., the time-delay estimations that 
are more than 20 days away from the truth. 

the visual inspection. We developed such a method after the end 
of TDCl. The concept of the method is to estimate a time delay 
by fitting a spline on one of the two light curves, and comput¬ 
ing the residual signal of the second light curve relative to the 
spline after applying time and magnitude shifts. The details of 
the method are described in Appendix A; the present section 
evaluates its performance and compares this estimation to the 
visual time-delay values. 

We characterize the efficiency of the automated Stage I pro¬ 
cedure by comparing its fraction of catastrophic outliers e with 
that of D3CS. We define catastrophic outliers as time-delay es¬ 
timations deviating from the truth by more than 20 days, i.e. 
with \/S.ti - A^il > 20 days. The time-delay and confidence es¬ 
timation evaluated by the automated procedure are reduced to 
two numbers: the depth of the absolute minimum p and the in¬ 
terseason variations of the microlensing Figure shows the 
evolution of the fraction of catastrophic outliers 6 as a function 
of the cumulative number of time-delay estimations, sorted by 
increasing p. The larger p is, the more confident the automated 
procedure in the time-delay estimation is. We study the impact 
of the automated procedure parameters p and ^ by introducing 
three subsamples of automated estimations: 

- the Crude-all subsample contains all the estimations; 

- the Crude-Imin subsample contains only the estimations 
for which the procedure finds a unique local minimum with 
a depth p <\\ 



0 1000 2000 3000 4000 5000 

Number Of Estimations 


Fig. 2. Cumulative evolution of the fraction of catastrophic out¬ 
liers 6 (in percentage points) as a function of the number of time- 
delay estimations. To produce the plot, the curves are first sorted 
according to the depth of their absolute minimum p, indicated in 
the colour bar. Each black line (solid, dashed and dotted) repre¬ 
sents a different subsample (see text for details). The coloured 
diamonds show the value of e for the D3CS combined samples; 
the corresponding number of estimations are indicated in paren¬ 
thesis. 

- the Crude-1.5^ subsample contains only the estimations 

with a magnitude shift deviation ^ < 1.5 at the location of 

the absolute minimum p. 

Figure shows the fraction of outliers e in the three subsamples 
and compares them to the value obtained visually with D3CS, 
which are shown as four diamond-shaped symbols correspond¬ 
ing to the combin ation of the four confidence categories of D3CS 
described in Sect. 13.11 We note that the uninformative D3CS es¬ 
timations are systematically considered as catastrophic outliers 
here. 

The selection criteria applied to the Crude-Imin and 
Crude -1.5^ subsamples are not able to decrease the fraction of 
outliers. This highlights how difficult it is to find efficient selec¬ 
tion criteria for the automated procedure parameters, although 
no exhaustive exploration of the parameters space has been con¬ 
ducted. As expected, all the D3CS subsamples contain fewer out¬ 
liers than the corresponding automated procedure subsamples. 
However, the efficiency of the latter are of the same order as the 
other automated methods presented in the TDCl paper, which 
have 6 = 2-3% when half of the TDCl data, i.e. 2500 light 
curve pairs, are measured. 

In conclusion, although the automated procedure presented 
here is less complete and reliable than D3CS, it yields candidates 
that can be evaluated by eye in a second phase. Such a combined 
approach would benefit both from the speed of the automated 
method and from the flexibility of the human eye estimate when 
dealing with a broad variety of properties in the light curves. 
We note, however, that in the rest of the paper, we only use the 
results obtained via D3CS as our Stage I estimates. 

4. Stage II: measuring time deiays 

Using the Stage I results as initial estimates, we proceed in 
this section by running our PyCS time-delay measurement tech¬ 
niques on the simulated TDCl light curves. In |Tewes et al. 
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( |2013a| ), three different algorithms were proposed: the simul¬ 
taneous fit of the light curves using free-knot splines, the re¬ 
gression difference technique, and an approach based on a dis¬ 
persion measurement of which the free-knot splines and the re¬ 
gression difference technique yielded the most accurate and pre- 
cise results when applied to simulated COSMOGRAIL data (in 
Courbin et al. |2Qll[|Tewes et al.||2013bt [Eulaers et al.||2013l 
Rathna Kumar et al.|2013| ). To analyse the TDCl simulations, we 


have therefore focused on adapting only these two most promis¬ 
ing methods for an automated use. 

We note again that our Stage II methods cannot be asked to 
judge the plausibility of a proposed delay. This step belongs to 
the Stage I method, i.e. to the visual inspection with D3CS to 
prevent or at least reduce catastrophic outliers. In practice, de¬ 
spite a correct Stage I estimate, any automated Stage II method 
may fail to converge, or it may yield a time-delay measurement 
that is incompatible with the initial approximation. To prevent 
these failures from contaminating our measurements we sys¬ 
tematically discard any Stage II result that does not lie within 
1.5 D3CS uncertainty estimate of the D3CS point estimate. This 
threshold acknowledges that the uncertainty estimates obtained 
from D3CS are typically over estimated by a fac tor of 2 to 3, 
which has been confirmed by |Liao et al.| ( |2015 ). We note that 
this rejection affects less than 1% of the light curve pairs and has 
no significant influence on the / metric. 


3. We did not analyse the time-delay measurement errors on the 
simulated curves as a function of true time delay. Instead, 
only the RMS error of these time-delay measurements was 
used as our total uncertainty estimate. 

4. Finally, we did not manually fine-tune any parameters or cor¬ 
rect for problematic model fits. 

As a result, the entire spl analysis took about 5 CPU-minutes 
per TDCl pair. 

4.2. Regression difference with splines 

Our second Stage II method, sdi (for sp line difference), is ba sed 
on the regression difference technique of Tewes et al.| ( |2013^ . To 
speed up the analysis, we replace the Gaussian process regres¬ 
sions by spline fits. In summary, the method independently fits 
a different spline to each of the two light curves, and then mini¬ 
mizes the variability of the difference between these two splines 
by shifting them in time with respect to each other. The advan¬ 
tage of this approach is that it does not require an explicit mi- 
crolensing model. To estimate the uncertainty, the sdi method 
is run on the same simulated light curves provided by the spl 
technique. The sdi method has an even lower computational 
cost than spl. 


5. Results on the Time Delay Challenge 1 (TDC1) 

4.1. Free-knot spline technique data 


In the free-knot spline technique (spl), each light curve in a 
pair is modelled as the sum of a spline representing the intrin¬ 
sic variability of the quasar, common to both images of the pair, 
and an independent spline representing the extrinsic variability 
due to microlensing. The intrinsic spline has a higher density 
of knots and is therefore more flexible accomodating the quasar 
variability, which is assumed to be faster than the microlensing 
variability. During the iterative fitting process, the light curves 
are shifted in time so as to optimize the between the data 
and the model light curve. To analyse a TDCl light curve pair, 
we repeat this fit 20 times, starting form different initial condi¬ 
tions covering the Stage I uncertainty. This tests the robustness 
of the optimization procedure. The best model fit is then used 
to generate 40 simulated noisy light curves with a range of true 
time delays around the best-fit solution and using the temporal 
sampling of the original light curves. By blindly rerunning the 
spline fit on these simulated data, and comparing the resulting 
delays with the true input time delays, the delay measurement 
uncertainty is estimated. 

We simplified and automated the spl algorithm for TDCl 
with respe ct to the description o f the free-knot spline method 
given in ( [Tewes et al. 2013a| ) and its applications to real 
COSMOGRAIL data. The main adaptations are the following: 


In this section, we separately analyse results from the Stage I and 
Stage II measurement processes as obtained on the simulated 
light curves of TDCl. General results from [Liao et al.jpOlSj ) 
regarding submissions prepared with our methods include the 
following: 

1. The Stage II methods spl and sdi show no significant devi¬ 
ations of the accuracy A from zero, and can thus be consid¬ 
ered as inherently unbiased, given the statistical limits due to 
the finite challenge size. 

2. The claimed precisions P of spl and sdi are very good, with 
X^ values of the order ofxl^i - 0.5 and;^g^. ^ 1.0. 

3. Based on results from the spl method simple power-law 
models for the dependence of A, P, and / on monitoring ca¬ 
dence, season length, and campaign length were adjusted. 
These relations attempt to capture general trends regarding 
the behaviour of all methods used in the context of TDCl, 
including our spl technique. 

In the present paper , we focus on aspe cts that are complementary 
to the discussion of [Liao et~ST] ( |2015| ). 

5.1. Efficiency of time-delay discovery (Stage I) 


Liao et ak] ([2015 


1. The temporal density of spline knots controlling the flexibil¬ 
ity of the intrinsic spline was computed from the signal-to- 
noise ratios measured on the two light curves, using an em¬ 
pirical calibration. The signal-to-noise ratios were obtained 
from a structure function, by comparing the typical ampli¬ 
tude of the light curve variability observed on a timescale 
of 50 to 75 days with the scatter between temporally adja¬ 
cent observing epochs. For the microlensing spline, the knot 
density was fixed to be the same for all TDCl pairs. 

2. When generating our mock light curves, we did not inject 
any fast microlensing signal to mimic correlated noise. Only 
plain white noise was added to the generative model. 


We start by analysing the fraction of light curve pairs for which 
a time delay can be discovered with visual inspection, as a func¬ 
tion of time delay and image brightness. This analysis relates 
only to the first stage of the time-delay measurement process. 

Aside from the time delay and the quasar image brightness, 
the question of discoverability depends on observational condi¬ 
tions (e.g. monitoring cadence and duration) and on astrophys- 
ical characteristics (e.g. amount of quasar variability and mi¬ 
crolensing perturbations). In the following, we select a given ob¬ 
serving strategy and average over the astrophysical parameters 
of the TDCl simulations, which follow clearly motivated distri¬ 
butions ( Dobler et al.|2015 ). A large sample of light curve pairs 
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Fig. 3. Quantitative analysis of the discoverability of time delays through the extensive visual search with D3CS (Stage I) in the case 
of four-month observing seasons and a cadence of 3 days. The coloured tiles show the fraction of discovered delays as a function 
of the photometric precision of the fainter quasar image and the true time delay of the system. The left panel shows results from the 
very conservative sec sample, and the central panel shows the less pure seep la sample that includes delay candidates considered 
as plausible (see text). The right panel, also for seep la, doubles the number of observing seasons. In each panel, only tiles covering 
more than three light curve pairs are shown. 



P 


P 


A 


Fig. 4. Summary of metrics obtained with the Stage II algorithms spl and sdi, without any a posteriori clipping of the outliers. The 
bottom row presents enlargements taken from the panels on the upper row. The shaded regions represent the somewhat arbitrary 
target areas that were defined in the TDCl paper. 


with almost identical observing conditions can be obtained by 
merging rungs 2 and 3. These rungs share the fiducial three-day 
monitoring cadence for five seasons of four months each. The 
differing cadence dis persion of 0.0 day s for rung 2 and 1.0 days 
for rung 3 (Table 1 of [Liao et al.|2015| ) do not have a significant 
impact on the discoverability of time delays. 

In practice, time delays can be measured accurately in pairs 
of light curves if the quality of both light curves is sufficient. 
In the following we consider as a relevant observable the photo¬ 
metric precision achieved in the fainter image of a pair. This is 
computed for each pair of light curves, as the median of the pho¬ 


tometric error bars across the epochs of the TDCl simulations. 
This is made legitimate by their overall effectiveness in repre¬ 
senting the amplitude of the simulated noise, except for very 


few “evil” epochs of some systems (see Section 2.5 of Liao et al. 


2015). However, when analysing real light curves, using the pho¬ 


tometric scatter between the points might be a better choice than 
using potentially mis-estimated photometric error bars. 


Figure presents the distribution of the fraction of light 
curve pairs for which time delays could be identified via a metic¬ 
ulous D3CS visual inspection for two different monitoring strate¬ 
gies. In the left panel, only time delays categorized as secure 
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Fig. 5. Quantitative analysis of the precision achieved for the Stage II time-delay measurement as a function of the photometric 
precision of the fainter quasar image and as a function of the true time delay. All panels show results from the bias-free spl 
technique for the secure + plausible selection of rungs 2 and 3 after rejection of the catastrophic outliers (see text). The top left 
panel shows the metric P as computed using the uncertainty estimates Sf without using Atf. The top right panel shows the RMS of 
the relative point estimation residuals without considering St. The bottom left panel shows the average P obtained in each tile after 
selecting only the best half of systems according to the blind precision in P. The bottom right panel shows a map of the;^^ metric. 
In all panels, only tiles describing more than three light curve pairs are shown. 


through the visual inspection are considered as discovered. This 
is very conservative because in a real survey, simple lens models 
will help to identify the correct time delay for almost all of the 
plausible systems as well. For this reason we also consider the 
combined secpla sample, shown in the central panel. 


Some of the cases categorized as multimodal could certainly 
also be resolved using simple lens model considerations, but in 
practice the vast majority of these light curve pairs contain too 
few clear common features to estimate a reliable time delay, 
even if an approximate value would be known from the mod¬ 
elling. We therefore consider the discoverability of the secpla 
selection shown in the central panel of Fig. as roughly repre¬ 
sentative of the fraction of potentially helpful delays that can be 
reliably measured from a real monitoring campaign or survey. 
It can be seen as an approximate lower limit for the fraction of 
time delays that can be observationally constrained in the ab¬ 
sence of prior from a lens model, provided the properties of the 
microlensing signal are similar to those of the simulated light 
curves used here. Finally, the right panel shows the evolution of 
this discoverability if the same monitoring strategy is carried out 
for five more seasons, i.e., for a total of ten years. 


We observe that after five years of monitoring, light curve 
pairs with a photometric precision in the fainter image better 
than (T = 0.1 mag and a true time delay shorter than At = 80 
days (2/3 of the season length) are very likely to yield a time- 
delay measurement. Pursuing the monitoring for five more years 
significantly increases the average chances that longer delays up 
to ~ 90% of the season length become measurable. 


5.2. Precision of time-delay measurement (Stage II) 


We now turn to the analysis of the time-delay measurements 
(Stage II) for all systems where the time delay is successfully 
discovered (Stage I). 


Figure [^summarizes the results of the spl and sdi methods 
in terms of the metrics A (accuracy), P (claimed precision), and 
as define d in Sectj^ The fig ure follows the same conventions 
as Table 4 of |Liao et~aL| ( |2015| ), but also includes measurements 
obtained on the secpla samples of each rung. As expected, the 
results for these secpla samples are more scattered than for the 
sec samples. The reason for these significant offsets in A and;^^ 
with respect to the sec results is the stronger impact of outliers 
on the non-robust metrics. 
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Fig. 6. Evolution of the TDCl metrics per rung as a function of the fraction of estimations / for the spl-sec and sdi-sec samples. 
The plots are sorted by increasing values of the blind precision P (see text). Shaded regions represent the error on the metrics. Solid 
grey lines show the target values for the metrics as defined in the TDCl paper. Dashed grey lines show the best possible value 
for each metric. The bottom row presents the non-cumulative evolution of the median of the true time delays lA^il in bins of ten 
estimations. 


To characterize the achievable precision of the Stage II mea¬ 
surements without being infiuenced by catastrophic outliers but 
still benefiting from a large number of light curve pairs, we now 
focus on the seep la sample from which we remove systems 
with estimated delays that are off by more than 20 days. This 
rejects about 1% of the secpla sample. We also note that catas¬ 
trophic outliers are errors of the Stage I estimate, not Stage II. 

Figure [^presents metrics related to the Stage II time-delay 
measurement precision as a function of the photometric qual¬ 
ity of the fainter quasar light curve and the time delay. In each 
tile the top left panel shows the average claimed precision P for 
the spl technique, for a five-year monitoring with four-month 
seasons and a cadence of three days. We find that the cadence 
dispersion plays no significant role in this analysis, and we there¬ 
fore merge rungs 2 and 3 to obtain this larger sample. 

In contrast, each tile of the top right panel shows the root 
mean square (RMS) of the relative error of the time-delay es¬ 
timates Ar. The observed structure is inevitably noisier because 
this RMS is computed from the actual point estimates, while the 
precision P is based on the corresponding claimed uncertainty 
estimates. We observe both a qualitative and a quantitative sim¬ 
ilarity between these plots, suggesting that the time-delay un¬ 
certainty estimates. Si (Eq. [^, from the PyCS techniques cor¬ 
rectly capture trends related to the photometric quality and to 
the amount of temporal overlap in the light curves. 

In the lower right panel of Fig. the map of metrics for 
the spl technique shows no strong evolution across the well- 
sampled regions of the parameter space. It does however indicate 


that the uncertainty estimates St from spl are too conservative 
by a small factor of ~ ^ 1.4. This is particularly 

visible for the high quality light curves with small time delays, 
i.e. with large temporal overlaps. Finally, the bottom left panel 
shows the average P metric computed using only the best half 
of light curve pairs in each tile, where the quality of a system 
is judged via the blind relative precision Pi (see Eq. lUi. This 
operation, aimed at increasing the precision, divides by two the 
usable fraction of systems as gi ven i n Fig.[^ We consider such a 
selection in more detail in Sect. 15.31 


We observe in Fig. that the best average relative preci¬ 
sion in time-delay measurements seems to be achieved for time 
delays in the range 40-80 days for this particular monitoring 
strategy. This corresponds to about half of the season length, 
and results from the trade-off between absolute delay length and 
amount of overlap in the light curves. 


Given the observed general aptitude of our time-delay uncer¬ 
tainty estimates, and thus R, to describe the actual point estimate 
errors committed by the spl technique, and the excellent com¬ 
petitiveness of spl compared t o other time dela y measurement 
techniques (see, e.g.. Fig. 13 of |Fiao et al.|2015 ), we see the left 
panels of Fig. as roughly representative of the precision that 
can be achieved by a state-of-the-art time-delay measurement 
method. 
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Fig. 7. Evolution of the TDCl metrics with the fraction of 
estimations sorted by increasing blind precision P, for the 
spl-sec and spl-secpla samples merging all rungs. The 
spl-secpla-cut sample has been cleaned a posteriori from 
the outliers in the spl-secpla sample. In doing so, we removed 
29 curves from the spl-secpla sample. The shaded regions, the 
solid and dashed grey lines are the same as in Fig|^ 


5.3. Effect on the overall metrics of selecting the best 
systems 

In Fig. we show the evolution of the overall average metrics as 
a function of the fraction of estimations / for the spl-sec and 
sdi-sec samples. The five columns represent the five rungs and 
the estimations are sorted according to their blind precision P, i.e 
the precision estimate from the TDCl data prior to the unblind¬ 
ing of the true values. The non-cumulative median value of the 
true delays (bottom row) is computed on consecutive bins of ten 
estimations. The shaded regions represent the statistical uncer¬ 
tainties of the metrics Xen- defined in Eq|7] These uncertainties 
are too small in the top row to be distinguished from the curves. 

In the top row, P increases monotonically with /. This is ex¬ 
pected since the estimations are sorted according to P and since 
the D3CS sec sample is free of any outliers. The metrics 
A, and Aabs, respectively in the second, third and fourth rows 
stabilize around a mean value once a significant fraction of es¬ 
timations is considered. The variations around the mean such as 
the jump in;^^ at / ~ 0.05 in rung 2 are due to non-catastrophic 
outliers, i.e. time delays that deviate significantly from the truth 
but by less than 20 days. These outliers are the result of a non- 
optimal convergence of the Stage II methods for the curves with 
the lowest signal-to-noise. 

The high-/ end of A and Aabs are subject to strong variations 
in all rungs. These variations occur for small absolute values of 
the true time delay \At\. Similarly, the high-/ end of P increases 


strongly. A small error on the time-delay estimation particularly 
affects P, A, and Aabs if the true time delay is small. 

This correlation between the loss in precision and accuracy 
means that for the corresponding light curves, our algorithms es¬ 
timate the time delays less accurately, but do provide larger error 
bars. We observe that the;^^ remains constant as / increases. In 
conclusion, sorting the measurements in P and rejecting a small 
fraction of the least precise estimations allows an optimal accu¬ 
racy to be maintained without affecting the;^^. 

Figure shows the evolution of the TDCl metrics with 
the fraction of estimations, /, sorted by increasing order of 
P, for the spl-sec and spl-secpla sample. The few catas¬ 
trophic outliers result in striking discontinuities in the curves. 
Quantifying the accuracy and precision of Stage II methods is 
different from avoiding such catastrophic outliers, and to address 
the former question, we also display in Fig. a new subsam¬ 
ple, spl-secpla-cut, where the 29 time-delay estimates with 
an absolute time-delay error larger than 20 days are a posteri¬ 
ori removed. Similarly, the impact of outliers can be reduced by 
considering the median of the individual metrics instead of their 
mean. This is not surprising, but nevertheless it reflects the need 
either to use metrics that can cope with outliers or, as in our 
Stage I approach, to make sure that no outliers contaminate the 
time-delay samples used for subsequent cosmological applica¬ 
tion. 


6. Summary and conclusions 

In this work, we used the simulated light curves of the Time 
Delay Challenge 1 (TDCl) proposed by Dobler et ar] ( |2015| ) to 
test the performance of the PyCS numerical techniques currently 
in use in COSMOGRAIF to measure time delays. These meth¬ 
ods are fully data-driven, in the sense that they do not attempt to 
include any physical model for microlensing or for the intrinsic 
variations of quasars. This choice was deliberate and considers 
an empirical representation of the data that minimizes the risk 
of bias due to choosing the wrong physical model. The price to 
pay is that error bars on the measurements must be estimated 
from mocks and that we cannot use prior knowledge from ex¬ 
ternal observations of quasars in a direct formal way. Using the 
same simulated light curves, we also assessed the quantity and 
quality of the time-delay measurements from future monitoring 
campaigns or surveys. We have made public our six main TDCl 
submissions, obtained using the D3CS, spl, and sdi methods 
for the high purity secure and the less conservative plausible 
saimles. These data are available on the COSMOGRAIF web- 
sitq3 Our results can be summarized as follows: 


1. The visual estimation of time delays (Stage I) is extremely 
efficient in spotting catastrophic outliers and in providing 
useful time-delay estimates to be refined with subsequent nu¬ 
merical techniques (Stage II). 

2. We attempted to build a simple automated time-delay esti¬ 
mation procedure that we can apply to the TDCl data. While 
useful, this automated procedure does not achieve as good 
purity in the time-delay sample as the visual estimation. We 
note that we did not use this automated procedure for any of 
our submissions to the TDCl. 

3. We provide a general analysis of the achievability of time- 
delay measurements as a function of the photometric preci¬ 
sion of the light curves. In particular we show that almost 
all time delays shorter than two-thirds of the season length 

^ http://www.cosinograil.org/tdcl 
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can be measured in five years of monitoring with four-month 
seasons and realistic photometric quality. 

4. Our Stage II methods spl and sdi can be considered un¬ 
biased given the statistical limits due to the finite challenge 
size. The;^^ values are close to unity. These good results em¬ 
phasize the reliability of COSMOGRAIL time-delay mea¬ 
surements in general. 

5. We quantify the average precision on the time-delay mea¬ 
surements as a function of photometric quality of the light 
curves. We find that the best average precision seems to be 
obtained for pairs whose time delay is approximately half of 
the monitoring season length. 

6. We show that we can reliably evaluate the individual preci¬ 
sion of our time-delay estimates. This may enable us, for any 
sample of light curves, to identify which are the most valu¬ 
able objects to be followed up for cosmological purposes. 
We note, however, that any selection on the time delays in a 
quasar sample may also result in biases on the cosmological 
inference. 

The above is true for the specific light curves of TDCl. These 
curves have been generated with simple models for quasar varia¬ 
tions and microlensing and they do not include all potential nui¬ 
sances of astrophysical, atmospheric, or instrumental origin. In 
addition, the PyCS techniques currently used in COSMOGRAIL 
do not attempt to account for these effects. 
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Appendix A: Automated Stage I procedure 

If more time had been devoted to the visual inspection, we ex¬ 
pect that more correctly estimated plausible time-delay esti¬ 
mations would have been classified as secure. After the TDCl, 
we developed an automated Stage I procedure. Its goal is to 
speed up and possibly improve the quality of the D3CS output 
by providing a range of reasonable initial time delays and as¬ 
sociated confidence estimation. The following section describes 
technically how the time-delay and confidence estimation for 
this automated procedure are computed. 

A.1. Time-delay estimation 

For each pair of curves, a boun ded-optimal-knot spline ( |Molinari| 
|et al.|2004}|Tewes et al.|2013a] ) s(t) is fitted to one of the two light 
curves. The second light curve l(t) is then shifted by an amount 6t 
in time and Sm in magnitude. Thus, for a given observing epoch 
t, the value of the shifted light curve can be written as l{t - 6t)s^. 
For every value of the time and magnitude shifts, we select all the 
N points in the second light curve that overlap in time with the 
spline. For these points, we compute the residuals reSn relative 
to the spline, i.e. the difference in magnitude between the points 
and the spline. The residual reSn for point n is 

reSn = [s(t) - l(t - St)sm]n- (A.l) 



Fig.A.l. Example of time-delay estimations and confidence 
level from the automated procedure. The vertical grey dashed 
line represents the true value of the time delay. The blue dia¬ 
monds correspond to the smallest absolute average residuals n 
returned by the automated procedure. The depth of the two min¬ 
ima yUi are represented by the number below each minimum. The 
colour bar indicates the microlensing variability ^ (see text). 


We then compute the average absolute residual r(St, Sm) for 
every time and magnitude shift, i.e. 


2. The depth of each minimum jji and the absolute (i.e. the 
deepest) minimum ju, 


r(St, Sm) ■■ 


1 ^ 

-y 

Af Zj 


Iresnl 


(A.2) 


The possible presence of microlensing, assumed to be con¬ 
stant over an observing season, is handled in a very simple 
way. For each time shift Sti, we apply independent magnitude 
shifts Sm^iSti) to each season j. We define the residual curve 
r = {ri,..., Ti,..., rx) as the sum of the smallest average abso¬ 
lute residuals for each season j. The i index runs from 1 to T 
and denotes the different time-delay values Sti we want to test. 
This translates into 

Ti = ^ min ^{Sti, dmj)j. (A.3) 

j ^ 


We define ^ as a local minimum in r if 


n < n+k, for k = 1...10, 


(A.4) 


where we keep the absolute minimum in r as the final time- 
delay estimation. The k index running from 1 to 10 span s a range 
of + 20 days around each tested value ri. Figure A.l shows a 
typical residual curve r with the absolute minimum indicated 
as a coloured diamond and the true time delay indicated as a 
vertical dashed grey line. 


V — V' 

/ii = -^ , fi = minljdi], (A.5) 

(Tr 

where CTr is the standard deviation between th e elem ents in r. 
Examples of values for yUi are indicated in Fig |A.l [ below the 
minima for time shifts St ^ -120 days and dr ^ -k80 days. 

3. The total magnitude shift Sm = {dmi, ...,Sm^} and the mi¬ 
crolensing variability ^{Sti,6m), where we use the per sea¬ 
son magnitude shifts Sm-^{Sti) minimizing the average abso¬ 
lute residual n at each time shift Sti, 


Smi = ^dmj(dri) 


^(Sti,Sm) 


min [Sm] - Sm^ 

^ 6m 


(A.6) 


where osm is the standard deviation between the elements in 
Sm. In other words the quantity ^{Sti,Sm) is the difference 
between the sum of the magnitude shifts applied to each sea¬ 
son at a given time shift Sti and the minimum of this sum on 
all time shifts. This quantity follows the colour code in the 
sidebar of Fig jA.lj and is equivalent to the season-to-season 
microlensing variations minimizing the residuals for a given 
time shift. The lower this quantity is, the smaller the impact 
of microlensing is. 


A.2. Confidence estimation 

For each pair of curves, we compute three parameters related to 
the shape of the residual curve r that can be used to estimate the 
quality of the time-delay estimations: 

1. The number of local minima in r. 
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